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This paper analyses the relation between 
the use of similarity in Memory-Based 
Learning and the notion of backed-off 
smoothing in statistical language model- 
ing. We show that the two approaches are 
closely related, and we argue that feature 
weighting methods in the Memory-Based 
paradigm can offer the advantage of au- 
tomatically specifying a suitable domain- 
specific hierarchy between most specific 
and most general conditioning information 
without the need for a large number of pa- 
rameters. We report two applications of 
this approach: PP-attachment and POS- 
tagging. Our method achieves state-of-the- 
art performance in both domains, and al- 
lows the easy integration of diverse infor- 
mation sources, such as rich lexical repre- 
sentations. 



1 Introduction 

Statistical approaches to disambiguation offer the 
advantage of making the most likely decision on the 
basis of available evidence. For this purpose a large 
number of probabilities has to be estimated from a 
training corpus. However, many possible condition- 
ing events are not present in the training data, yield- 
ing zero Maximum Likelihood (ML) estimates. This 
motivates the need for smoothing methods, which re- 
estimate the probabilities of low-count events from 
more reliable estimates. 

Inductive generalization from observed to new 
data lies at the heart of machine-learning approaches 
to disambiguation. In Memory-Based Learning^ 
(MBL) induction is based on the use of similarity 



^The Approach is also referred to as Case-based, 
Instance-based or Exemplar-based. 



In this paper we describe 
how the use of similarity between patterns embod- 
ies a solution to the sparse data problem, how it 
relates to backed-off smoothing methods and what 
advantages it offers when combining diverse and rich 
information sources. 

We illustrate the analysis by applying MBL to 
two tasks where combination of information sources 
promises to bring improved performance: PP- 
attachment disambiguation and Part of Speech tag- 
ging- 

2 Memory-Based Language 
Processing 

The basic idea in Memory-Based language process- 
ing is that processing and learning are fundamen- 
tally interwoven. Each language experience leaves a 
memory trace which can be used to guide later pro- 
cessing. When a new instance of a task is processed, 
a set of relevant instances is selected from memory, 
and the output is produced by analogy to that set. 

The techniques that are used are variants and 
extensions of the classic /c-nearest neighbor {k- 
NN) classifier algorithm. The instances of a task 
are stored in a table as patterns of feature-value 
pairs, together with the associated "correct" out- 
put. When a new pattern is processed, the k nearest 
neighbors of the pattern are retrieved from memory 
using some similarity metric. The output is then de- 
termined by extrapolation from the k nearest neigh- 
bors, i.e. the output is chosen that has the highest 
relative frequency among the nearest neighbors. 

Note that no abstractions, such as grammatical 
rules, stochastic automata, or decision trees are ex- 
tracted from the examples. Rule-like behavior re- 
sults from the linguistic regularities that are present 
in the patterns of usage in memory in combination 
with the use of an appropriate similarity metric. 
It is our experience that even limited forms of ab- 



straction can harm performance on linguistic tasks, 
which often contain many subregularities and excep- 
tions (Daclcmans, 1996). 



2.1 Similarity metrics 

The most basic metric for patterns with symbolic 
features is the Overlap metric given in equations |l| 
and H; where A{X,Y) is the distance between pat- 
terns X and y, represented by n features, Wi is a 
weight for feature i, and d is the distance per fea- 
ture. The fc-NN algorithm with this metric, and 



equal weighting for all features is called iBl (Aha 
et ai, 1991). Usually k is set to 1. 



where: 



S{xt,yt) = if y.i, else 1 



(1) 



(2) 



This metric simply counts the number of 
(mis)matching feature values in both patterns. If 
we do not have information about the importance 
of features, this is a reasonable choice. But if we 
do have some information about feature relevance 
one possibility would be to add linguistic bi as to 
weight or select different features ( Cardie, 199(S| ) . An 
alternative — more empiricist — approach, is to look 
at the behavior of features in the set of examples 
used for training. We can compute statistics about 
the relevance of features by looking at which fea- 
tures are good predictors of the class labels. Infor- 
mation Theory gives us a useful tool for measuring 



feature relevance in this way (Quinlan, 1986; Quin- 
Ian, 1993| ). 



Information Gain (IG) weighting looks at each 
feature in isolation, and measures how much infor- 
mation it contributes to our knowledge of the cor- 
rect class label. The Information Gain of feature / 
is measured by computing the difference in uncer- 
tainty (i.e. entropy) between the situations with- 
out and with knowledge of the value of that feature 
(Equation ||) . 



= H{C) 



E.gy, P{V) X H{C\V) 

s»(/) 



veVf 



(3) 
(4) 



Where C is the set of class labels, Vj is 
the set of values for feature /, and H{C) — 
— X^cec Pi^) 1*^S2 is the entropy of the class la- 
bels. The probabilities are estimated from relative 



frequencies in the training set. The normalizing fac- 
tor si{f) (split info) is included to avoid a bias in 
favor of features with more values. It represents the 
amount of information needed to represent all val- 
ues of the feature (Equation ^). The resulting IG 
values can then be used as weights in equation |^. 
The fc-NN algorithm with this metric is called iBl- 



IG (Daelemans & Van den Bosch, 1992) 



The possibility of automatically determining the 
relevance of features implies that many different and 
possibly irrelevant features can be added to the fea- 
ture set. This is a very convenient methodology if 
theory does not constrain the choice enough before- 
hand, or if we wish to measure the importance of 
various information sources experimentally. 

Finally, it should be mentioned that MB- 
classifiers, despite their description as table-lookup 
algorithms here, can be implemented to work 
fast, using e.g. tree-based indexing into the case- 
base (Daelemans et ai, 1997). 



3 Smoothing of Estimates 

The commonly used method for probabilistic clas- 
sification (the Bayesian classifier) chooses a class 
for a pattern X by picking the class that has the 
maximum conditional probability P{class\X). This 
probability is estimated from the data set by looking 
at the relative joint frequency of occurrence of the 
classes and pattern X. If pattern X is described by 
a number of feature- values xi, . . . , Xm we can write 
the conditional probability as P{class\xi, . . . , Xn)- If 
a particular pattern a;^ , . . . , a;^ is not literally present 
among the examples, all classes have zero ML prob- 
ability estimates. Smoothing methods are needed to 
avoid zeroes on events that could occur in the test 
material. 

There are two main approaches to smoothing: 
count re-estimation smoothing such as the Add-One 
or Good- Turing methods (Church & Gale, 1991), 



and Back-off type methods (Bahl et ai, 1983; Katz 



1987; Chen fc Goodman, 1996; Bamuclsson, 1996) 



We will focus here on a comparison with Back-off 
type methods, because an experimental comparison 
in iChen fc Goodman (1996| ) shows the superiority 



of Back-off based methods over count re-estimation 
smoothing methods. With the Back-off method the 
probabilities of complex conditioning events are ap- 
proximated by (a linear interpolation of) the proba- 
bilities of more general events: 

p{class\X) = Xxp{class\X) + Xx'PiclasslX') 
+ ■ ■ ■ + Xx'^PiclasslX'') (5) 



Where p stands for the smoothed estimate, p for 
the relative frequency estimate, A are interpolation 
weights, X]"=o ~ ^^'^ X ^ X'^ for all i, 
where is a (partial) ordering from most specific 
to most general feature-sets^ (e.g the probabilities 
of trigrams (X) can be approximated by bigrams 
{X') and unigrams {X")). The weights of the lin- 
ear interpolation are estimated by maximizing the 
probability of held-out data (deleted interpolation) 
with the forward-backward algorithm. An alterna- 
tive method to determine the interpolation weights 
without iterative training on held-out data is given 
in Samuelsson (1996). 

We can assume for simplicity's sake that the A^; 
do not depend on the value of X*, but only on i. In 
this case, if F is the number of features, there are 
2^ — 1 more general terms, and we need to estimate 
Ai's for all of these. In most applications the inter- 
polation method is used for tasks with clear order- 
ings of feature-sets (e.g. n-gram language modeling) 
so that many of the 2^ — 1 terms can be omitted 
beforehand. More recently, the integration of infor- 
mation sources, and the modeling of more complex 
language processing tasks in the statistical frame- 
work has increased the interest in smoothing meth- 
ods (Collins fc Brooks, 1995; Ratnaparkhi, 1996 



Magerman, 1994| ; |Ng fc Lee, 1996| ; [Collins, 1996| ). 
For such applications with a diverse set of features 
it is not necessarily the case that terms can be ex- 
cluded beforehand. 

If we let the A^; depend on the value of X^, the 
number of parameters explodes even faster. A prac- 
tical solution for this is to make a smaller number 
of buckets for the X\ e.g. by clustering (see e.g. 



Magerman (1994D ) 



Note that linear interpolation (equation ^) actu- 
ally performs two functions. In the first place, if the 
most specific terms have non-zero frequency, it still 
interpolates them with the more general terms. Be- 
cause the more general terms should never overrule 
the more specific ones, the A^^ for the more general 
terms should be quite small. Therefore the inter- 
polation effect is usually small or negligible. The 
second function is the pure back-off function: if the 
more specific terms have zero frequency, the proba- 
bilities of the more general terms are used instead. 
Only if terms are of a similar specificity, the A's truly 
serve to weight relevance of the interpolation terms. 

If we isolate the pure back-off function of the in- 
terpolation equation we get an algorithm similar to 
the one used in [Collins fc Brooks (19951 ). It IS given 
in a schematic form in Table [1[. Each step consists 



-< X' can be read as X is more specific than X' . 



of a back-off to a lower level of specificity. There 
are as many steps as features, and there are a total 
of 2^ terms, divided over all the steps. Because all 
features are considered of equal importance, we call 
this the Naive Back-off algoviihm.. 

If /(xi, ...,x„) > 0: 

p{c\Xi, ...,Xn) = f{xx,[..,Xr,) 

Else if /(xi, ...,a;„_i, *)-!-... + /(*,X2, ...,a;„) > 0: 

f,(A^ rr \ — /(c,3:i,...,x„_i,*) + ... + /(c,*,a:2,...,a;„) 

p\c\d.i, ...,d.n) — /(a:i,...,x„_i, *)-!-. ..+/(*,X2,...,x„) 

Else if ... : 

p{c\xi, ...,Xn) = ^ 



Else if /(a;i, *, ...,*) + ... + /(*, a;„) > 0: 

-/ I \ /(c,a;i,*,...,*) + ...-t-/(c,*,...,*,a:^) 

p[C\Xi, ...,Xn) — f(xi,*,...,*) + ...+f{*,...,*,x„) 



Table 1: The Naive Back-off smoothing algorithm. 
f{X) stands for the frequency of pattern X in the 
training set. An asterix (*) stands for a wildcard in 
a pattern. The terms at a higher level in the back-off 
sequence are more specific (^) than the lower levels. 

Usually, not all features x are equally important, 
so that not all back-off terms are equally relevant 
for the re-estimation. Hence, the problem of fitting 
the parameters is replaced by a term selection 
task. To optimize the term selection, an evaluation 
of the up to 2^ terms on held-out data is still neces- 
sary. In summary, the Back-off method does not pro- 
vide a principled and practical domain-independent 
method to adapt to the structure of a particular do- 
main by determining a suitable ordering -< between 
events. In the next section, we will argue that a for- 
mal operationalization of similarity between events, 
as provided by MBL, can be used for this purpose. 
In MBL the similarity metric and feature weighting 
scheme automatically determine the implicit back- 
off ordering using a domain independent heuristic, 
with only a few parameters, in which there is no 
need for held-out data. 

4 A Comparison 

If we classify pattern X by looking at its nearest 
neighbors, we are in fact estimating the probabil- 
ity P{class\X), by looking at the relative frequency 
of the class in the set defined by simk{X), where 
sirrikiX) is a function from X to the set of most sim- 



ilar patterns present in the training data^. Although 
the name "fc-nearest neighbor" might mislead us by 
suggesting that classification is based on exactly k 
training patterns, the simk{X) function given by the 
Overlap metric groups varying numbers of patterns 
into buckets of equal similarity. A bucket is defined 
by a particular number of mismatches with respect 
to pattern X. Each bucket can further be decom- 
posed into a number of schemata characterized by 
the position of a wildcard (i.e. a mismatch). Thus 
simk{X) specifies a -< ordering in a Collins & Brooks 
style back-off sequence, where each bucket is a step 
in the sequence, and each schema is a term in the 
estimation formula at that step. In fact, the un- 
weighted overlap metric specifies exactly the same 
ordering as the Naive Back-off algorithm (table ||) . 
In Figure ^ this is shown for a four-featured pat- 
tern. The most specific schema is the schema with 
zero mismatches, which corresponds to the retrieval 
of an identical pattern from memory, the most gen- 
eral schema (not shown in the Figure) has a mis- 
match on every feature, which corresponds to the 
entire memory being best neighbor. 

If Information Gain weights are used in combina- 
tion with the Overlap metric, individual schemata 
instead of buckets become the steps of the back-off 
sequence^. The -< ordering becomes slightly more 
complicated now, as it depends on the number of 
wildcards and on the magnitude of the weights at- 
tached to those wildcards. Let S be the most specific 
(zero mismatches) schema. We can then define the 
-< ordering between schemata in the following equa- 
tion, where A(X, F) is the distance as defined in 
equation ^ 



S' < S" ^ A{S, S') < A(5, S") 



(6) 



Note that this approach represents a type of im- 
plicit parallelism. The importance of the 2^ back-off 
terms is specified using only F parameters — the IG 
weights-, where F is the number of features. This 
advantage is not restricted to the use of IG weights; 
many other weighting schemes exist in the machine 



(1997) 



learning literature (see Wcttschcrcck et 
for an overview). 



Using the IG weights causes the algorithm to rely 
on the most specific schema only. Although in most 
applications this leads to a higher accuracy, because 
it rejects schemata which do not match the most 

^Note that MBL is not limited to choosing the best 
class. It can also return the conditional distribution of 
all the classes. 

^Unless two schemata are exactly tied in their IG 
values. 



important features, sometimes this constraint needs 
to be weakened. This is desirable when: (i) there 
are a number of schemata which are almost equally 
relevant, (ii) the top ranked schema selects too few 
cases to make a reliable estimate, or (iii) the chance 
that the few items instantiating the schema are mis- 
labeled in the training material is high. In such 
cases we wish to include some of the lower-ranked 
schemata. For case (i) this can be done by discrctiz- 
ing the IG weights into bins, so that minor differ- 
ences will lose their significance, in effect merging 
some schemata back into bu ckets. For (ii) and (iii) , 
and for continuous metrics ( IStanfiU fc Wahz, 198e| ; 



[Cost fc Salzberg, 1993 ) which extrapolate from ex- 
actly k neighbors^, it might be necessary to choose a 
k parameter larger than 1. This introduces one addi- 
tional parameter, which has to be tuned on held-out 
data. We can then use the distance between a pat- 
tern and a schema to weight its vote in the nearest 
neighbor extrapolation. This results in a back-off 
sequence in which the terms at each step in the se- 
quence are weighted with respect to each other, but 
without the introduction of any additional weight- 
ing parameters. A weighted voting function that 
was found to work well is due to Dudani (1976 ): the 
nearest neighbor schema receives a weight of 1.0, the 
furthest schema a weight of 0.0, and the other neigh- 
bors are scaled linearly to the line between these two 
points. 

5 Applications 
5.1 PP-attachment 

In this section we describe experiments with MBL 
on a data-set of Prepositional Phrase (PP) attach- 
ment disambiguation cases. The problem in this 
data-set is to disambiguate whether a PP attaches 
to the verb (as in / ate pizza with a fork) or to the 
noun (as in / ate pizza with cheese). This is a dif- 
ficult and important problem, because the semantic 
knowledge needed to solve the problem is very diffi- 
cult to model, and the ambiguity can lead to a very 
large number of interpretations for sentences. 

We used a data-set extracted from the Penn 



Treebank WSJ corpus by Ratnaparkhi et al. (1994). 
It consists of sentences containing the possibly 
ambiguous sequence verb noun-phrase PP. Cases 
were constructed from these sentences by record- 
ing the features: verb, head noun of the first noun 
phrase, preposition, and head noun of the noun 
phrase contained in the PP. The cases were la- 
beled with the attachment decision as made by 



^Note that the schema analysis does not apply to 
these metrics. 



Overlap 

exact match 1 mismatch 2 mismatches 3 mismatches 




Overlap IG 




Figure 1: An analysis of nearest neighbor sets into buckets (from left to right) and schemata (stacked). IG 
weights reorder the schemata. The grey schemata are not used if the third feature has a very high weight 
(see section 5.1). 



Method 


% Accuracy 


iBl (=Naive Back-off) 
IBI-IG 

LcxSpace IG 


83.7 % 
84.1 % 
84.4 % 


Back-off model (Collins & Brooks) 
C4.5 (Ratnaparkhi et al.) 
Max Entropy (Ratnaparkhi et al.) 
Brill's rules (Collins & Brooks) 


84.1 % 
79.9 % 
81.6 % 
81.9 % 



Table 2: Accuracy on the PP-attachment test set. 



the parse annotator of the corpus. So, for the 
two example sentences given above we would get 
the feature vectors ate, pizza, with, fork.V. and 
ate, pizza, with, cheese, N. The data-set contains 
20801 training cases and 3097 separate test cases, 
and was also used in poUins fc Brooks (1995 ) . 



Table || shows a comparison of accuracy on the 
test-set of 3097 cases. We can see that iBl, which 
implicitly uses the same specificity ordering as the 
Naive Back-off algorithm already performs quite well 
in relation to other methods used in the literature. 
Collins & Brooks' (1995) Back-off model is more so- 
phisticated than the naive model, because they per- 
formed a number of validation experiments on held- 
out data to determine which terms to include and, 
more importantly, which to exclude from the back- 
off sequence. They excluded all terms which did 
not match in the preposition! Not surprisingly, the 
84.1% accuracy they achieve is matched by the per- 
formance of IBI-IG. The two methods exactly mimic 
each others behavior, in spite of their huge differ- 
ence in design. It should however be noted that the 
computation of IG-weights is many orders of mag- 
nitude faster than the laborious evaluation of terms 
on held-out data. 



The IG weights for the four features (V,N,P,N) 
were respectively 0.03, 0.03, 0.10, 0.03. This idcnti- 
fies the preposition as the most important feature: 



its weight is higher than the sum of the other three 
weights. The composition of the back-off sequence 
following from this can be seen in the lower part 
of Figure |l|. The grey-colored schemata were effec- 
tively left out, because they include a mismatch on 
the preposition. 



We also experimented with rich lexical represen- 
tations obtained in an unsupervi sed way from word 



co-occurrences in raw WSJ text ( Zavrel fc Vecnstr 



1995| ; Schiitze, 1994 ). We call these representations 
Lexical Space vectors. Each word has a numeric 25 
dimensional vector representation. Using these vec- 
tors, in combination with the IG weights mentioned 
above and a cosine metric, we got even slightly bet- 
ter results. Because the cosine metric fails to group 



the patterns into discrete schemata, it is necessary 
to use a larger number of neighbors {k = 50). The 
result in Table ^ is obtained using Dudani's weighted 
voting method. 

Note that to devise a back-off scheme on the basis 
of these high-dimensional representations (each pat- 
tern has 4 X 25 features) one would need to consider 
up to 2^*^° smoothing terms. The MBL framework 
is a convenient way to further experiment with even 
more complex conditioning events, e.g. with seman- 
tic labels added as features. 

5.2 POS-tagging 

Another NLP problem where combination of differ- 
ent sources of statistical information is an impor- 
tant issue, is POS-tagging, especially for the guess- 
ing of the POS-tag of words not present in the lex- 
icon. Relevant information for guessing the tag of 
an unknown word includes contextual information 
(the words and tags in the context of the word), and 
word form information (prefixes and suffixes, first 
and last letters of the word as an approximation of 
affix information, presence or absence of capitaliza- 
tion, numbers, special characters etc.). There is a 
large number of potentially informative features that 
could play a role in correctly predicting the tag of 
an unknown word ( Ratnaparkhi, 1996| ; Weischedel 



et al, 1993; Daelemans et al, 1996). A priori, it 



is not clear what the relative importance is of these 
features. 

We compared Naive Back-off estimation and MBL 
with two sets of features: 

• PDASS: the first letter of the unknown word (p), 
the tag of the word to the left of the unknown 
word (d), a tag representing the set of possible 
lexical categories of the word to the right of the 
unknown word (a), and the two last letters (s). 
The first letter provides information about cap- 
italisation and the prefix, the two last letters 
about suffixes. 

• PDDDAAASSS: more left and right context fea- 
tures, and more suffix information. 

The data set consisted of 100,000 feature value 
patterns taken from the Wall Street Journal corpus. 
Only open-class words were used during construc- 
tion of the training set. For both ib1-ig and Naive 
Back-off, a 10-fold cross-validation experiment was 
run using both PDASS and pdddaaasss patterns. 
The results are in Table |. The IG values for the 
features are given in Figure ^. 

The results show that for Naive Back-off (and ibI) 
the addition of more, possibly irrelevant, features 





Figure 2: IG values for features used in predicting 
the tag of unknown words. 



quickly becomes detrimental (decrease from 88.5 to 
85.9), even if these added features do make a gener- 
alisation performance increase possible (witness the 
increase with IBI-IG from 88.3 to 89.8). Notice that 
we did not actually compute the 2^° terms of Naive 
Back-off in the PDDDAAASSS condition, as IBI is 
guaranteed to provide statistically the same results. 
Contrary to Naive Back-off and iBl, memory-based 
learning with feature weighting (ib1-IG) manages 
to integrate diverse information sources by differ- 
entially assigning relevance to the different features. 
Since noisy features will receive low IG weights, this 
also implies that it is much more noise-tolerant. 

6 Conclusion 

We have analysed the relationship between Back- 
off smoothing and Memory-Based Learning and es- 
tablished a close correspondence between these two 
frameworks which were hitherto mostly seen as un- 
related. An exception is the use of similarity for al- 
leviating the sparse data problem in language mod- 





IBl, Naive Back-off 


IBl-IG 


PDASS 

PDDDAAASSS 


88.5 (0.4) 
85.9 (0.4) 


88.3 (0.4) 
89.8 (0.4) 



Table 3: Comparison of generalization accuracy of 
Back-off and Memory-Based Learning on prediction 
of category of unknown words. All differences are 
statistically significant (two-tailed paired t-test, p < 
0.05). Standard deviations on the 10 experiments 
are between brackets. 



eling (Essen & Steinbiss, 1992; Brown et al, 1992; 
Dagan et ai, 1994). However, these works differ in 



their focus from our analysis in that the emphasis 
is put on similarity between values of a feature (e.g. 
words), instead of similarity between patterns that 
are a (possibly complex) combination of many fea- 
tures. 

The comparison of MBL and Back-off shows that 
the two approaches perform smoothing in a very sim- 
ilar way, i.e. by using estimates from more general 
patterns if specific patterns are absent in the train- 
ing data. The analysis shows that MBL and Back-off 
use exactly the same type of data and counts, and 
this implies that MBL can safely be incorporated 
into a system that is explicitly probabilistic. Since 
the underlying fc-NN classifier is a method that does 
not necessitate any of the common independence or 
distribution assumptions, this promises to be a fruit- 
ful approach. 

A serious advantage of the described approach, 
is that in MBL the back-off sequence is specified 
by the used similarity metric, without manual in- 
tervention or the estimation of smoothing parame- 
ters on held-out data, and requires only one param- 
eter for each feature instead of an exponential num- 
ber of parameters. With a feature-weighting met- 
ric such as Information Gain, MBL is particularly 
at an advantage for NLP tasks where conditioning 
events are complex, where they consist of the fusion 
of different information sources, or when the data is 
noisy. This was illustrated by the experiments on 
PP-attachment and POS-tagging data-sets. 
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